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VOICE COMMUNICATION BY PHONEME RECOGNITION AND TEXT TO SPEECH 



FIELD OF THE INVENTION 

This invention relates generally to the field of voice communications and more 
particularly to compression or reduction of data required for voice communications. 

BACKGROUND ART 

10 Voice communication is typically conducted over the Public Switched 

Telephone Network (PSTN), in which a virtual dedicated circuit is established for 
each call. In such a circuit, a real-time connection is established that allows two-way 
transmission of data during the telephone call. Data communication can also be 
performed on such virtual circuits. However, data communication is increasingly 

15 being performed on wide-area data networks, such as the Internet, which provide a 
widely available and low-cost shared communications medium. Voice 
communications over such data networks is possible and is attractive because of the 
potentially lower cost of communicating over data networks, and the simplicity and 
lower cost of performing data and voice communications over a single network. 

20 However, the real-time nature of voice communications, coupled with the bandwidth 
required for such communication, often makes use of data networks for voice 
communication impractical. The bandwidth required for conventional voice 
communication also limits the use of services such as video conferencing which 
require significant additional amounts of bandwidth. 

25 Accordingly, there is a need for techniques that reduce the amount of 

transmitted data required for voice communications. 

SUMMARY OF THE INVENTION 
In a principal aspect, the present invention reduces the amount of data 
required to be transmitted for voice communication. In accordance with a first object 

30 of the invention, voice data is transmitted by generating, in response to voice inputs 
(110) from a user, speech sample data (112) indicative of a sample of the user's voice. 
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During a communication session, voice transmission data is generated as a function 
of the user's voice spoken during the communication session. The voice transmission 
data is then transmitted to a receiving station (101) designated in the communication 
session. The user's spoken voice is then recreated at the receiving station as a 
5 function of the speech sample data (112). 

Transmission of voice data in such a manner greatly reduces the bandwidth 
required for voice communication. Voice communications over data networks 
therefore becomes more feasible because the reduced bandwidth helps to alleviate the 
latency often encountered in data networks. A further advantage is that the 
10 decreased bandwidth required by voice communications frees bandwidth for 
transmission of additional data, such as video data for video-conferencing. 

These and other features and advantages of the present invention may be 
better understood by considering the following detailed description of a preferred 
embodiment of the invention. In the course of this description reference will be 
15 frequently made to the attached drawings. 

BRIEF DESCRIPTION OF THE DRAWINGS 
Figure 1 is a block diagram of voice communication in accordance of the 
principles of the present invention. 

Figures 2, 3, 4, 5 and 6 are flowcharts illustrating operation of a preferred 
20 embodiment. 

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS 
In Figure 1, communications devices 101.1 and 101.2 operate in accordance 
with the principles of the present invention to perform two-way voice 
communication across network 102. Communications devices 101.1 and 101.2 are 
.25 shown in Figure 1 as being the same type of device and are referred to herein 
collectively as "communications devices 101." The corresponding elements of 
communications devices 101 are also designated by numerical suffixes of .1 and .2 to 
designate correspondence with the appropriate communications device 101.1 or 
101.2. 

30 Network 102 can take a variety of forms. For example, network 102 can take 

the form of a publicly accessible wide area network, such as the Internet. 
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Alternatively network 102 may take a form of a private data network such as is found 
within many organizations. Alternatively, network 102 may comprise the Public 
Switched Telephone Network (PSTN). The exact form of the data network 102 is not 
critical; instead, the data network 102 must simply be able to support full-duplex, 
5 real-time communication, at a rate which the user would find acceptable in a PC 
remote-control product (e.g. 9600 baud). 

Communications devices 101 include a processing engine 104, a storage device . 
106, an output device 108, and respond to voice and other inputs 110. 
Communications device 101 also includes the necessary hardware and software to 

10 transmit data to and receive data from network 102. Such hardware and software can 
include, for example, a modem and associated device drivers. The processing engine 
104 preferably takes the form of a conventional digital computer programmed to 
perform the functions described herein. The storage device 106 preferably takes a 
conventional form that provides capacity and data transfer rates to allow processing 

15 engine 104 to store and retrieve data at a rate sufficient to support real-time two-way 
voice communication. The output device(s) 108 can include a plurality of types of 
output devices including visual display screens, and audio devices such as speakers. 
Voice and other inputs 110 are entered by way of conventional input devices, such as 
microphones for voice inputs, and keyboards and pointing devices for entry of text, 

20 graphical data, and commands. 

The communications devices 101 operate generally by accepting voice inputs 
110 from a user and generating, in response thereto, a speech sample 112, which 
contains symbols indicative of the user's speech. The speech sample 112 preferably 
contains a plurality of symbols indicative of the entire range of sounds necessary in 

25 order to generate, from the user's voice inputs during a phone conversation, a stream 
of symbols that can be decoded by a receiving device (such as a communication 
station 101) to generate an accurate reproduction of the users voice inputs. For 
example, the speech sample 112 can include all letters of the alphabet, numbers from 
0 through 9, and the names of days, weeks and months of the year. In addition, 

30 speech sample 112 can include additional symbols such as certain words that may be 
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stored with different inflections and additional words, terms, or phrases that may be 
particularly unique to a particular user. 

To converse, the user speaks into an audio input device, and processing engine 
104 converts the voice inputs 110 to a stream of symbols that are transmitted to 
5 another communications device across network 102. The stream of symbols that are 
transmitted comprise far less data than a conventional digitized stream of a user s 
voice. Therefore, a two-way voice conversation can be conducted using significantly 
fewer network resources than required for a conventional two-way conversation 
conducted by transmission of digitized voice streams. Communications devices 101 

10 operating in accordance with the principles of the present invention therefore require 
lower performance networks. Alternatively/in higher performance networks, 
communications devices 101 allow other network functions to occur concurrently. 
For example, other data may be transmitted on the network 102 while one or more 
voice conversations are being conducted. The lower bandwidth utilization of 

15 communications devices 101 also allows other data to be transmitted during the two- 
way conversation. For example, the decreased network utilization may allow the 
transmission of other data in support of the conversation, such as video data or other 
types of data used in certain application programs, such as spreadsheets, word 
processing data programs, or databases. 

-20 As previously noted, the processing engine 104 preferably takes the form of a 

conventional digital computer, such as a personal computer that executes programs 
stored on a computer -readable storage medium to perform the functions described. 
The functions described herein however need not be implemented in software. The 
functions described herein may also be implemented in either software, hardware, 

25 firmware, or a combination thereof. The flow charts shown in Figures 2, 3, 4, 5 and 6 
illustrate operation of a preferred embodiment of communications devices 101. 

Figure 2 illustrates an initialization routine 200 performed by processing 
engine 104 to generate speech sample 112. Initialization routine 200 is started by 
determining at step 202 if the user is a new user. If the user is not new, meaning that 

30 a speech sample 112 for that user already exists, then the routine is terminated at step 
214. If the user is new, meaning that there is no speech sample 112 for the particular 
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user, then in step 204 the user is prompted to read sample text. For example, in step 
204, sample text may be displayed on an output device 108. The sample text is 
representative of commonly spoken sounds such as letters of the alphabet, integers 
from zero through nine, days of the week, and months of the year. These sounds are 
5 merely illustrative and other sounds can also be entered. For example, peculiarities 
of a user's speech or accent can be accounted for by having the user read certain 
words or phrases. The user can repeat certain, or all, text in various ways, such as at 
fast and slow rates, to account for different speech patterns. Certain users are aware 
of their own speech peculiarities and can therefore enter their own sample text and 

10 read it back. However, in many cases it may be preferable to use various types of 
sample text that are generated by those having particular knowledge of linguistics 
and/ or various accents and languages. For example, different speech samples can be 
provided for men, women, and children. Different or additional sample text can be 
provided for people with different accents. 

15 Voice input from the user reading the sample text shown at step 204 is entered 

into the communication device 101 by way of a microphone and is converted to 
speech sample 112 at step 206, and then is stored at step 208 to storage device 106. At 
step 210, processing engine 104 generates test speech using the stored speech sample 
112 and provides the test speech by way of output device 108 in the form of an 

20 audible signal. The user is then prompted to inform the communication device 101 if 
the outputted speech accurately reflects the sample text. If so, then at step 212 the 
speech sample 112 is determined to be acceptable and the routine is terminated at 
step 214. If the user indicates at step 212 that the generated speech is unacceptable 
then steps 204, 206, 210 and 212 are repeated until an adequate speech sample 112 is 

25 generated. The routine is then terminated at step 214. 

Generation of symbols indicative of the user's speech at step 206 is performed 
by speech recognition engine that converts a digitized signal indicative of a user's 
voice into text or other type of symbols such as phonemes, which are fundamental 
notations for sounds of speech. More specifically, phonemes are commonly described 

30 as abstract units of the phonetic system of a language that correspond to a set of 
similar speech sounds which are perceived to be a single distinctive sound in the 
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language. Speech recognition engines are commercially available. For example, the 
Via Voice product from IBM has a speech recognition engine that takes speech input 
and generates text indicative of the speech. A developers kit for this engine is also 
available from IBM. This kit allows the speech recognition engine of the type in the 
5 Via Voice product to be used to generate text, phonemes or other types of output 
indicative of the user's speech. Such an engine also has the capability to convert 
speech to text or a similar representation. Such an engine can also produce realistic 
sounding speech by connecting synthesized or prerecorded phonemes. 

Once the speech sample 112 has been stored, a call can be made using 

10 communication device 101 to perform voice communication in accordance with the 
principles of the present invention. A call is originated in accordance with the steps 
shown in Figure 3, which shows an originate call routine 300. At step 302, the user 
identifies the party to be called by selecting a recipient of the call from a list provided 
by communications device 101, or by entering data such as a telephone number or 

15 network address for the recipient. At step 304, communications device 101.1 

establishes communications with the recipient, such as communications device 101.2, 
shown in Figure 1. At step 304, configuration information and user preference 
information are exchanged between the two communications devices 101. An 
example of the configuration information or user preference information is 

20 information indicating whether or not video conferencing or other services are 
required. Further examples are rate of speech generation and optional display of 
speech as text. The communications link established between the communications 
devices 101 can be shared for other purposes such as video conferencing or remote 
control. At step 306, a choice is provided to the user as to whether the recipient's 

25 speech is to be rendered via simulated voice generation in accordance with the 

principles of the present invention, or rendered using generic speech generation. If 
generic speech generation is selected then, at step 310, conversation between the 
calling party and receiving party is performed. Otherwise, at step 308, a test is 
performed to determine if communications device 101.2 has a current copy of the 

30 recipient's speech sample file 112.1. If so, then two-way voice communications are 
initiated at step 310. Otherwise, at step 312 communications device 101.1 transmits 
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the speech sample file 112.1 to communications device 101.2 and conversation is 
performed at step 310 until the call is terminated at step 314. 

A similar sequence of functions is performed by receiving station 101.2, in 
response to origination of a call by station 101.1. Steps 402, 404, 406, 408, 410, 412 and 
5 414 correspond to steps 302, 304, 306, 308, 310, 312 and 314, respectively, of Figure 3. 
At step 402, communications device 101.2 responds to a phone ring or network 
connection request initiated by device 101.1. At step 404, device 101.2 establishes 
communications with the originating device 101,1 and exchanges configuration and 
preference information at step 406. The user at device 101.2 is given an option of 

10 conducting the conversation by way of generic speech generation or in accordance 
with the principles of the present invention from speech samples 112. At step 408, 
determination is made if the device 101.2 contains a current copy of the speech 
sample 112.1 of the user of device 101.1. If so then conversation is performed in step 
410. Otherwise, at step 412, the speech sample 112.1 is transmitted to the 

15 communications device 101.2 for use in the conversation. The conversation is 
performed at step 410 and then is subsequently terminated at 414. 

Figure 5 shows further details of steps 310 and 410 in Figures 3 and 4. At step 
502, each processing engine 104.1 and 104.2 converts the received speech from the 
user of the corresponding communications device into phonetically equivalent text in 

20 accordance with the appropriate speech sample 112. Steps 502, 504 and 506 are 
repeated until the conversation is determined to be over at step 508, at which point 
the step 310 or 410 is terminated at step 510. 

Each communications device also executes a listening routine shown in Figure 
6 in addition to the talking routine shown in Figure 5. At step 602, the symbols 

25 transmitted by the transmitting communications device are received and converted at 
step 606 into simulated speech using the appropriate speech sample file 112. 
Alternatively, the symbols received can be converted into text for visual display. 
Steps 602, 604, and 606 are repeated until a determination is made at step 608 that the 
conversation is over. The listening routine is then terminated at step 610. 

30 It is to be understood that the specific methods and apparatus which have been 

described herein are merely illustrative of one application of the principles of the 
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invention and numerous modifications may be made to the subject matter disclosed 
without departing from the true spirit and scope of the invention. 
WHAT IS CLAIMED IS: 
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CLAIMS 

1. Apparatus comprising: 

a processing engine responsive to a user's voice input for generating 
speech sample data indicative of predetermined portions of said user's voice; 
5 a storage device, responsive to said processing engine, for storing said 

speech sample data; 

said processing engine further comprising a communication module, 
responsive to a communication session, for generating transmission data, indicative 
of said user's voice spoken during said communication session, as a function of said 
10 speech sample data, and for causing transmission of said transmission data to a 
remotely located recipient of said communication session. 

2. Apparatus as set forth in claim 1 wherein said processing engine 
encrypts said transmission data prior to transmission to said remotely located 
recipient. 

15 3. Apparatus as set forth in claim 1 wherein said speech sample data 

comprises a plurality of alphabetic letters. 

4. Apparatus as set forth in claim 1 wherein said speech sample data 
further comprises a plurality of single digit integers. 

5. Apparatus as set forth in claim 1 wherein said speech sample data 
20 further comprises a plurality of phonemes. 

6. Apparatus as set forth in claim 5 wherein said commonly spoken words 
comprise calendar days, weeks, and months. 

7. Apparatus as set forth in claim 1 wherein said processing engine 
transmits said sample speech data to said recipient prior to transmission of said 

25 transmission data. 

8. A method for transmitting voice data, said method comprising: 
generating speech sample data indicative of a sample of a user's voice in 

response to voice inputs from said user; 

responding to a request for a communication session by generating 
30 voice transmission data as a function of said user's voice spoken during said 
communication session; and 



9 



WO 00/19412 



PCT7US99/22630 



causing transmission of said voice transmission data to a receiving 
station designated in said communication session. 

9. A method as set forth in claim 8 comprising the further step of 
encrypting said voice transmission data prior to transmission. 
5 10. A method as set forth in claim 8 wherein said voice transmission data is 

converted by said receiving station to audible sounds indicative of said user's spoken 
voice. 

11. A method as set forth in claim 8 wherein said voice transmission data is 
converted by said receiving station to a visual representation indicative of said user's 

10 spoken voice. 

12. A method as set forth in claim 8 wherein speech sample data is also 
transmitted to said receiving station and wherein said voice transmission data is 
converted by said receiving station to an audible representation of said user's voice 
spoken during said communication session as a function of said speech sample data. 

15 13. A method as set forth in claim 8 wherein speech sample data is also 

transmitted to said recipient and wherein said voice transmission data is converted 
by said recipient to a visual representation of said user's voice spoken during said 
communication session as a function of said speech sample data. 

14. A method as set forth in claim 8 wherein said speech sample data 
20 comprises a plurality of alphabetic letters. 

15. A method as set forth in claim 14 wherein said speech sample data 
further comprises a plurality of single digit integers. 

16. A method as set forth in claim 8 wherein said speech sample data 
further comprises commonly spoken words and phonemes. 

25 17. A method as set forth in claim 16 wherein said commonly spoken words 

comprise calendar days, weeks, and months. 

18. A method as set forth in claim 8 wherein said communication session 
comprises transmission of said sample speech data to said recipient. 

19. A method as set forth in claim 8 comprising the further step of 

30 receiving, from said receiving station, voice data in the form of signals corresponding 
to a spoken voice of a user of said receiving station. 
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20. A method as set forth in claim 8 comprising the further step of 
receiving, from said receiving station, voice transmission data indicative of words 
spoken by a user of said receiving station and generating, as a function of speech 
sample data indicative of a sample of said user of said receiving station, audible 

5 sounds indicative of said words spoken by said user of said receiving station. 

21. A method as set forth in claim 8 comprising the further step of 
receiving, from said receiving station, voice transmission data indicative of words 
spoken by a user of said receiving station and generating, as a function of speech 
sample data indicative of a sample of said user of said receiving station, visual 

10 representations of said words spoken by said user of said receiving station. 

22. A computer-readable storage medium comprising a set of computer 
programming instructions for causing two-way voice communication over a shared 
communications medium, the set of computer programming instructions comprising: 

a voice sampling module for generating speech sample data as a 
15 function of a user's spoken voice; and 

a voice conversion module, responsive to establishment of a 
communication session between said user and a second party, for converting said 
user's spoken voice, as a function of said speech sample data, to voice transmission 
data, and for causing transmission to said second party via a remote site coupled to 
20 said shared communications medium. 

23. A computer-readable storage medium as set forth in claim 22 wherein 
said voice sampling module comprises means for causing said speech sample data to 
be converted to audible outputs for review by said user. 
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